This is the R Markdown document for hw-05, which focusses on factor and figure management exercises from the gapminder dataset.
Here are the goals of the assignment:
For more information and details on the assignment, please click here.
This report will be broken down into four parts:
For simple reshaping, gather() and spread() from tidyr - to help a more tidier dataset. For data joining, there are two data sources and info from both datasets are needed - hence a multitude of join prompts will be used to c ombine info from both datasets into a single new object.
All of the functions used for the data reshaping and data join will be available under the tidyverse package. For full information about hw-04, please visit here.
As usual, we will begin with loading the packages needed for the analysis (tidyverse for functions and gapminder as the dataset of interest). If these packages are being used for the first time on your R-Studio client, then the packages will need to be installed first prior to being loaded.
# This report will use the follow packages, please install the packages to load them later on into the report
# install.packages('tidyverse')
# install.packages('gapminder')
# install.packages('readr')
# install.packages('scales')
# install.packages('plotly')
# install.packages('svglite')
# Load tidyverse and gapminder
library(tidyverse)
library(gapminder)
For this section, the focus is to drop Oceania and reorder the levels of continent.
For this sub-section, the focus will be to:
First we will assign the original gapminder data to an object (gm_orig) and assign to a new object after filtering all observations involving Oceania (gm_filter_O)
gm_orig <- gapminder # assign original gapminder dataset to an object
gm_filter_O <- gm_orig %>% # filter out Oceania continent observations into gm_filter_O
filter(continent %in% c("Americas", "Africa", "Asia", "Europe"))
# Lets view the structure of our dataset from the original gapminder data set and the dataset with Oceania observations filtered out:
str(gm_orig)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
str(gm_filter_O)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
As observed, filtering observations involving Oceania meant we lost 24 observations. Even though we filtered out the observations involving Oceania gm_orig, the structure of continent in both datasets still show all levels (5 continents in total). This is also true for country, where we still have all 142 levels/unique contries.
Levels don’t always have to be present in the factor or observed in the dataset. So even though we lost 24 observations because we filtered out observations involving Oceania, we need to drop the unused factor level “Oceania” as well.
There is two ways to go about this, we can either drop all unused factor levels through droplevels() or drop unused factors for specific variables via fct_drop().
gm_filter_O %>%
droplevels() %>% # drop all unused factors
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
gm_filter_O%>%
mutate(continent = fct_drop(continent)) %>% # only drop unused factors in continent
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
The droplevels() function removed all unused factors in country and continent - the only two factor type variables in the gapminder dataset hence we only have 140 unique countries and 4 continents. From droplevels, we found out that there were two countries under the Oceania region. The fct_drop() function focussed primarily on dropping unused factors in continents, hence we have 4 continents but countries the same - with 142 countries, as we did not drop unused Oceania countries in this scenario.
Here is a summary for the number of rows and levels of affected factors before and after filtering the dataset.
Original gapminder dataset:
Gapminder dataset with Oceania filtered out:
Unused factor levels with drop_levels():
Unused factor levels with fct_drop(continent):
For this section, I will focus the data primarily on countries from Europe in 2007
For this section, we are tasked with using the forcats package to change the order of the factor levels, based on a summary of one of the quantitative variables.
# Filter gapminder to include only observations on countries from Europe and then drop all unused factors from country and continent
gm_Europe_2007 <- gapminder %>%
filter(continent %in% c("Europe"), year == "2007") %>%
droplevels() # drops all factors not pertaining to Europe
nlevels(gm_Europe_2007$country) # Number of factors / unique countries in Europe.
## [1] 30
After filtering our dataset to include only European countries, we found ourselves with 30 unique countries.
Let’s say I am interested in the population of each country in our gapminder Europe dataset, we can do our usual ggplot with geom_point to give a visual plot of the population for each country.
# Create point plot with the population of each country in Europe, 2007
gm_Europe_2007 %>%
ggplot(aes(pop, country)) +
geom_point(aes(colour=country)) + # fill unique countries by different colours
theme_bw() + # give the graph a white background
labs(x = "Population", y = "Country", # add labels
title = "Scatterplot for Population by European Countries, in 2007",
caption = "Figure 1. Scatterplot of European country vs population")
Now this graph looks very uninformative, as the points are all over the place. R is programmed by default to plot the points by alphabetical order for our variable of interest (country). To rectify this problem, we can try to change our plot to order from lowest population to highest population.
The plots produced in this section (Figure 2) will be raw, without much edits - Part 3: Visualization Design will add more detail to this plot (heading, background, axis labels, fig.height, fig.width, and legend, etc)
# Create point plot with lowest to highest population in Europe, 2007
gm_Europe_2007 %>%
mutate(country = fct_reorder(country, pop)) %>% # points are from lowest to highest pop
ggplot(aes(pop, country)) +
geom_point(aes(colour=country)) + # fill unique countries by different colours
labs(caption = "Figure 2. Scatter ordered by increasing population")
This looks much better, and it allows us to follow countries with increasing population. At the same time, we can quickly spot out countries with the lowest/highest populations. To make it even easier to read, we can do a horizontal bar chart listing populations from lowest population to highest,
gm_Europe_2007 %>%
mutate(country = fct_reorder(country, pop)) %>% # bars are from lowest to highest pop
ggplot(aes(country,pop)) +
geom_bar(aes(fill=country), stat="identity") + # fill countries by different colours
coord_flip() + # flip country into y-axis, pop to x-axis
theme_bw() + # give the graph a white background
labs(x = "Population", y = "Country", # add labels
title = "Barchart by Increasing Population by European Countries",
caption = "Figure 3. Bar chart: Ordered by increasing population")
Now lets say we are interested in France in particular, and want to compare this to the rest of the data. We can re-order our factor to include France as the first observation in the bottom for our visual plot - followed by the rest of the countries, giving us a visual comparison. We will present this in a horizontal bar chart
gm_France <- gm_Europe_2007$country %>%
fct_relevel("France") # put France as first factor in country variable
gm_Europe_2007$country <- gm_France # store the new order of factors into main dataset
# Create bar chart with France as the first observation on the bottom, 2007
gm_Europe_2007 %>%
ggplot(aes(country,pop)) +
geom_bar(aes(fill=country), stat="identity") + # fill countries by different colours
coord_flip() + # flip country into y-axis, pop to x-axis
theme_bw() + # give the graph a white background
labs(x = "Population", y = "Country", # add labels
title = "Barchart for Population with France as First Observation",
caption = "Figure 4. Bar chart: France vs other European populations")
For the file import and export section, I will experiment with the write_csv()/read_csv()vand saveRDS()/readRDS() function. The idea is to experiment with changing the format of our 2007 European gapminder dataset, such as changing the order of how the information is displayed for variables, and saving it + loading it into R-Studio to see if the format remains.
The European 2007 gapminder dataset is ordered by countries in alphabetical order. For this sub-section, I will arrange the dataset by increasing life expectancy.
library(readr) # load read_csv and write_csv
# Arrange the 2007 European dataset by life expectancy (lowest to highest)
gm_reorder_write <- gm_Europe_2007 %>%
arrange(lifeExp)
gm_reorder_write %>%
knitr::kable(caption = "This table summarizes gapminder European countries in 2007, ordered by increasing life expectancy")
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Turkey | Europe | 2007 | 71.777 | 71158647 | 8458.276 |
| Romania | Europe | 2007 | 72.476 | 22276056 | 10808.476 |
| Bulgaria | Europe | 2007 | 73.005 | 7322858 | 10680.793 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Serbia | Europe | 2007 | 74.002 | 10150265 | 9786.535 |
| Montenegro | Europe | 2007 | 74.543 | 684736 | 9253.896 |
| Slovak Republic | Europe | 2007 | 74.663 | 5447502 | 18678.314 |
| Bosnia and Herzegovina | Europe | 2007 | 74.852 | 4552198 | 7446.299 |
| Poland | Europe | 2007 | 75.563 | 38518241 | 15389.925 |
| Croatia | Europe | 2007 | 75.748 | 4493312 | 14619.223 |
| Albania | Europe | 2007 | 76.423 | 3600523 | 5937.030 |
| Czech Republic | Europe | 2007 | 76.486 | 10228744 | 22833.309 |
| Slovenia | Europe | 2007 | 77.926 | 2009245 | 25768.258 |
| Portugal | Europe | 2007 | 78.098 | 10642836 | 20509.648 |
| Denmark | Europe | 2007 | 78.332 | 5468120 | 35278.419 |
| Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
| Finland | Europe | 2007 | 79.313 | 5238460 | 33207.084 |
| Germany | Europe | 2007 | 79.406 | 82400996 | 32170.374 |
| United Kingdom | Europe | 2007 | 79.425 | 60776238 | 33203.261 |
| Belgium | Europe | 2007 | 79.441 | 10392226 | 33692.605 |
| Greece | Europe | 2007 | 79.483 | 10706290 | 27538.412 |
| Netherlands | Europe | 2007 | 79.762 | 16570613 | 36797.933 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.493 |
| Norway | Europe | 2007 | 80.196 | 4627926 | 49357.190 |
| Italy | Europe | 2007 | 80.546 | 58147733 | 28569.720 |
| France | Europe | 2007 | 80.657 | 61083916 | 30470.017 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.748 |
| Spain | Europe | 2007 | 80.941 | 40448191 | 28821.064 |
| Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.419 |
| Iceland | Europe | 2007 | 81.757 | 301931 | 36180.789 |
# save gm_reorder to project working directory
write_csv(gm_reorder_write, "gm_reorder_write.csv", col_names = TRUE)
# read gm_reorder to global environment and check if the countries are still listed by increasing life expectancy
gm_reorder_read <- read_csv("gm_reorder_write.csv")
## Parsed with column specification:
## cols(
## country = col_character(),
## continent = col_character(),
## year = col_integer(),
## lifeExp = col_double(),
## pop = col_integer(),
## gdpPercap = col_double()
## )
gm_reorder_read %>%
knitr::kable(caption = "This table summarizes gapminder European countries in 2007, ordered by increasing life expectancy")
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Turkey | Europe | 2007 | 71.777 | 71158647 | 8458.276 |
| Romania | Europe | 2007 | 72.476 | 22276056 | 10808.476 |
| Bulgaria | Europe | 2007 | 73.005 | 7322858 | 10680.793 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Serbia | Europe | 2007 | 74.002 | 10150265 | 9786.535 |
| Montenegro | Europe | 2007 | 74.543 | 684736 | 9253.896 |
| Slovak Republic | Europe | 2007 | 74.663 | 5447502 | 18678.314 |
| Bosnia and Herzegovina | Europe | 2007 | 74.852 | 4552198 | 7446.299 |
| Poland | Europe | 2007 | 75.563 | 38518241 | 15389.925 |
| Croatia | Europe | 2007 | 75.748 | 4493312 | 14619.223 |
| Albania | Europe | 2007 | 76.423 | 3600523 | 5937.030 |
| Czech Republic | Europe | 2007 | 76.486 | 10228744 | 22833.309 |
| Slovenia | Europe | 2007 | 77.926 | 2009245 | 25768.258 |
| Portugal | Europe | 2007 | 78.098 | 10642836 | 20509.648 |
| Denmark | Europe | 2007 | 78.332 | 5468120 | 35278.419 |
| Ireland | Europe | 2007 | 78.885 | 4109086 | 40675.996 |
| Finland | Europe | 2007 | 79.313 | 5238460 | 33207.084 |
| Germany | Europe | 2007 | 79.406 | 82400996 | 32170.374 |
| United Kingdom | Europe | 2007 | 79.425 | 60776238 | 33203.261 |
| Belgium | Europe | 2007 | 79.441 | 10392226 | 33692.605 |
| Greece | Europe | 2007 | 79.483 | 10706290 | 27538.412 |
| Netherlands | Europe | 2007 | 79.762 | 16570613 | 36797.933 |
| Austria | Europe | 2007 | 79.829 | 8199783 | 36126.493 |
| Norway | Europe | 2007 | 80.196 | 4627926 | 49357.190 |
| Italy | Europe | 2007 | 80.546 | 58147733 | 28569.720 |
| France | Europe | 2007 | 80.657 | 61083916 | 30470.017 |
| Sweden | Europe | 2007 | 80.884 | 9031088 | 33859.748 |
| Spain | Europe | 2007 | 80.941 | 40448191 | 28821.064 |
| Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.419 |
| Iceland | Europe | 2007 | 81.757 | 301931 | 36180.789 |
As observed, the format of the re-ordered table (by increasing life expectancy) was maintained when we wrote and read the file back into our global environment.
Try the same procedure, but with saveRDS() and readRDS()
gm_reorder_save <- gm_Europe_2007 %>%
arrange(lifeExp)
# head of the reordered table
head(gm_reorder_save) %>%
knitr::kable(caption = "This table summarizes gapminder European countries in 2007, ordered by increasing life expectancy")
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Turkey | Europe | 2007 | 71.777 | 71158647 | 8458.276 |
| Romania | Europe | 2007 | 72.476 | 22276056 | 10808.476 |
| Bulgaria | Europe | 2007 | 73.005 | 7322858 | 10680.793 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Serbia | Europe | 2007 | 74.002 | 10150265 | 9786.535 |
| Montenegro | Europe | 2007 | 74.543 | 684736 | 9253.896 |
# save file to project directory via saveRDS
saveRDS(gm_reorder_save, "gm_reorder_save.csv")
# read file and look at the head of the reordered table again
gm_reorder_RDS <- readRDS("gm_reorder_save.csv")
head(gm_reorder_RDS) %>%
knitr::kable(caption = "This table summarizes gapminder European countries in 2007, ordered by increasing life expectancy")
| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Turkey | Europe | 2007 | 71.777 | 71158647 | 8458.276 |
| Romania | Europe | 2007 | 72.476 | 22276056 | 10808.476 |
| Bulgaria | Europe | 2007 | 73.005 | 7322858 | 10680.793 |
| Hungary | Europe | 2007 | 73.338 | 9956108 | 18008.944 |
| Serbia | Europe | 2007 | 74.002 | 10150265 | 9786.535 |
| Montenegro | Europe | 2007 | 74.543 | 684736 | 9253.896 |
As observed, the saveRDS() and readRDS() function was also able to save and return the re-ordered 2007 European gapminder object in the same format.
For this part, I will take the scatterplot from Figure 2 and add labels, titles, better background use, better legend placement, etc to give it a cleaner and more presentable look.
# Create point plot with lowest to highest population in Europe, 2007
gm_Europe_2007 %>%
mutate(country = fct_reorder(country, pop)) %>%
ggplot(aes(pop, country)) +
geom_point(aes(colour=country)) +
scale_x_continuous(breaks = 0:8 * (10^7)) + # x-axis break every 10^7 unites
theme_bw() + # give the graph a white background
labs(x = "Population", y = "Country", # add labels
title = "Scatterplot of Increase Population for European Countries, in 2007",
color='Legend: Country',
caption = "Figure 5. Increasing Population of European Countries in 2007") +
theme(legend.position="bottom") # put legend to bottom
Since the Figure 2 plot cannot be processed by plotly, I will create a separate scatterplot detailing GDP per capita and life expectancy, with population as a reference (legend). The resulting scatter will be broken in continents via facet_wrap(), before being put into a plotly object. The data will be based on the gapminder dataset - all countries, excluding Oceania countries.
library(scales) # For scale functions below
plotly_prep <- ggplot(gm_filter_O, aes(gdpPercap, lifeExp)) +
geom_point(aes(colour=pop), alpha=0.1) + # scatter of lifeExp and gdpPercap
scale_x_log10(labels=dollar_format()) + # change scale to natural log for linearity
scale_colour_viridis_c(
trans = "log10", # log transformation of our data
breaks = 10^(1:10), # x-axis breaks every 10^(1:10) units
labels = comma_format() # add commas to x-axis labels every 10^(1:10) units
) +
scale_y_continuous(breaks=10*(1:10)) + # y-axis breaks every 10 units
facet_wrap(~ continent) + # break into separate scatters by continent
theme_bw() + # give the graph a white background
labs(x = "GDP per Capita", y = "Life Expectancy (in Years)",
title = "Life Expectancy vs. GDP per capita, by Continent",
color='Population',
caption = "Figure 6. Life Expectancy (in Years) vs. GDP per capita, by Continent") + # add labels to axis, title, legend
theme(strip.background = element_rect(fill = "yellow"),
legend.position = "none") # title header bg color, leave out legend
print(plotly_prep)
Lastly, we will run a scatterplot for all countries into one through the plotly function and see what kind of new information we can achieve through a plotly object.
library(plotly)
ggplotly(plotly_prep)
# Could not find a way to remove the overlay of x-axis and y-axis on the axis units
As we can see, the plotly function allows us to have light interaction with the plot. When we hover over the data points, it will give us its coordinates. The ggplot2 scatter only displays the points on the scatter. I had issues with the overlay of the x-axis with the axis units.
For this section, I will be using the ggsave() function to save a ggplot into the project directory. The same plot will be saved on raster format (.jpg) and vector format (.svg)
# Save as raster image (jpg)
ggsave("Figure 6 ggplot.jpg", plot = plotly_prep, units = "cm", height = 10, width = 12)
The file has been saved as a .jpg file, with a dimension of 12 cm by 10 cm. Sizing can be adjusted through height and width and depending on which unit measurement calibration you’re interested in, there are other options such as inches (“in”) or millimetre (“mm”).
Here is the code to save the ggplot as a .svg file.
# Save as raster image (jpg)
library(svglite)
ggsave("Figure 6 ggplot.svg", plot = plotly_prep, units = "cm", height = 10, width = 12)
Here is the code to load the saved .jpg and .svg file:

